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PROBLEMS  AND  RESULTS 


This  contract  continued  ongoing  research  in  three  main  problem  areas:  (1)  diagnostic  and 
critical  data  analysis,  (2)  distribution  shapes  in  real  data,  and  (3)  computer  implementation  of  selected 
data-analytic  techniques. 

Diagnostic  and  Critical  Data  Analysis 

As  data  analysis  proceeds  in  an  exploratory  mode,  it  often  raises  questions  about  the  strength 
of  patterns  that  have  been  uncovered  or  about  individual  observations  that  seem  unus  lal.  Critical 
data  analysis  aims  to  deal  with  these  types  of  confirmatory  questions. 

During  the  period  of  the  contract  our  research  in  this  area  involved  outlier  detection,  robust 
analysis  of  variance,  regression  diagnostics,  robust/resistant  techniques  in  quality  control,  and  other 
applications. 

In  earlier  work  related  to  outlier  detection,  Hoaglin,  Iglewicz,  and  Tukey  (1986)  studied  the 
performance  of  a  standard  outlier-labeling  technique  from  exploratory  data  analysis.  This  rule  uses  the 
lower  fourth  and  the  upper  fourth  of  the  sample  (approximate  sample  quartiles)  as  the  basis  for 
the  cutoffs 


—  k(F^j  —  Fj^)  and  F^j  +  k(F^  — F^)  (1) 

with  k  =  1.5.  Any  observations  that  fall  outside  these  cutoffs  are  labeled  as  possible  outliers  and 
subjected  to  closer  scrutiny  whenever  feasible. 

This  work  measured  the  performance  of  this  and  other  rules  in  terms  of  the  some-outside  rate 
per  sample :  the  probability  that  a  sample  of  n  contains  at  least  one  observation  beyond  the  cutoffs.  By 
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design  the  basic  exploratory  rule  (k=1.5)  has  a  relatively  high  some-outside  rate  per  sample  for 
Gaussian  dafa:  roughly  between  15  percent  and  35  percent  as  n  ranges  from  5  to  50.  The  original 
definition  gives  the  fourths  in  terms  of  the  depth,  f  =  ^[(n+3)/2]  with  [•]  the  greatest-integer  function, 
and  the  sample  order  statistics  <  •••  <  according  to 

FL  =  X(f)  and  FU  =  X(n+1— f)  ‘ 

As  a  consequence  the  some-outside  rate  per  sample  increases  as  n  increases,  following  separate  but 
similar  curves  for  the  four  values  of  n  mod  4.  This  behavior  may  be  undesirable  if  one  does  not  insist 
on  a  simple  definition  of  f  that  facilitates  hand  calculation. 

Hoaglin  and  Iglewicz  (1987b)  studied  the  results  of  fine-tuning  such  outlier-labeling  rules  in 
two  ways.  First,  they  considered  two  forms  of  smooth  interpolation  for  the  fourths:  the  “ideal” 
definition  fj  =  n/4  +  (5/12)  and  an  alternative  based  on  a  common  definition  of  the  quartiles,  fq  = 
n/4  +  (1/4).  With  these  definitions  the  some-outside  rate  per  sample  follows  a  single  smooth  curve 
against  n.  Second,  to  permit  use  of  such  rules  for  more  traditional  outlier  detection,  they  determined 
values  of  the  constant  k  (for  fj,  fq,  and  the  standard  definition,  and  at  selected  sample  sizes  n  <  300) 
that  maintain  the  some-outside  rate  per  sample  at  .10  or  .05  for  Gaussian  data.  As  a  rough 
approximation  the  rule  based  on  fj  can  use  k  =  2.2  out  to  about  n  =  50  for  a  rate  of  .05.  For  a  rate  of 
.10  no  constant  value  of  k  seems  satisfactory;  a  crude  approximation  yields  k  ss  2.02  —  2.5/n  for  7  < 
n  <  52. 

In  addition  to  their  simple  form,  outlier-detection  rules  that  use  resistant  cutoffs  (as  in 
equation  (1))  have  the  convenient  property  that,  within  reasonable  limits,  the  user  does  not  need  to 
specify  the  maximum  possible  number  of  outliers.  The  limitation  arises  from  the  breakdown  point  of 
the  location  measure  and  scale  measure  used  in  specifying  the  cutoffs.  For  a  measure  of  location,  say, 
breakdown  involves  replacing  observations  in  the  sample  by  arbitrarily  extreme  values.  The 
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breakdown  point  is  the  smallest  fraction  of  such  contamination  that  can  cause  the  value  of  the  measure 
to  become  arbitrary.  Because  of  the  way  that  they  use  the  fourths,  the  cutoffs  in  equation  (1)  have  a 
breakdown  point  of  essentially  25%. 

One  can  also  start  with  a  rcbust/resistant  location  estimator  T  and  a  robust/resistant  scale 
estimator  S  and  then  use  these  to  define  an  outlier-labeling  rule  by  placing  cutoffs  at  T±kS  for  some 
suitable  positive  constant  k.  By  choosing  particular  T  and  S  that  have  good  efficiency  as  robust 
estimators,  one  hopes  to  produce  an  outlier-labeling  rule  (and  corresponding  outlier-detection  rule)  that 
performs  well  according  to  a  variety  of  criteria.  Hoaglin  and  Iglewicz  (1987a)  investigated  rules  based 
on  a  biweight  M-estimator  of  location,  Tbi>  and  a  biweight  A-estimator  of  scale,  s^j  (see,  for  example, 
Kafadar  (1983)  and  Lax  (1985)).  To  facilitate  comparison,  they  matched  the  population  values  of  the 
cutoffs  ±  k,sbi  *n  Gaussian  data  to  those  of  the  exploratory  rules  in  equation  (1)  by  taking  k*  = 
0.6745  +  1.349k.  The  resulting  some-outside  rate  per  sample  (estimated  by  simulation)  for  Gaussian 
data  is  about  11  percentage  points  lower  for  n  <  50  than  the  rate  for  the  standard  exploratory  rule  and 
about  7  percentage  points  lower  than  that  for  the  exploratory  rule  with  fj  as  the  depth  of  the  fourths. 
That  is,  when  the  data  are  ideally  well-behaved,  the  biweight  rule  less  often  labels  any  observations  as 
outside.  Hoaglin  and  Iglewicz  also  studied  other  characteristics  of  the  biweight  rules;  and  they 
determined  values  of  k  (again  at  selected  n  <  300)  that  correspond  to  a  some-outside  rate  per  sample 
of  .10  and  .05,  so  that  the  rule  can  be  used  for  outlier  detection. 

Thus  far  all  the  results  described  pertain  to  null  situations,  in  which  the  data  come  from  a 
Gaussian  distribution.  Hoaglin,  Iglewicz,  and  Tukey  (1986)  included,  as  alternative  null  situations, 
some  symmetric  distributions  with  heavier  tails,  but  none  of  the  data  have  contained  known  outliers. 
To  pursue  the  performance  of  various  rules  in  detecting  specified  outliers  (one  suitable  definition  of 
power  in  this  context),  Hoaglin  and  Iglewicz  (1988)  developed  a  class  of  fixed-outlier  models.  A  model 
of  this  type  begins  with  r  fixed  observations  and  completes  a  “sample”  of  n  by  taking  n  — r  random 
observations  from  the  standard  Gaussian  distribution. 
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For  r=l  and  the  fixed  observation  at  x,  one  determines  the  probability  that  a  rule  labels  the 
observation  at  x  as  an  outlier.  By  systematically  varying  x  one  obtains  the  rule's  performance  curve, 
Pn(x).  When  r=2,  the  performance  curve  generalizes  to  two  performance  surfaces.  If  and  x0  are 
the  fixed  observations,  Pni(xj>x2)  *s  t^ie  probability  that  the  rule  labels  Xj  an  outlier,  and  Pn2(xi'xo) 
is  defined  similarly  for  X2*  For  some  rules  the  task  of  studying  the  performance  surfaces  simplifies, 
because  the  cross-sections  at  X2  =  Xj,  X2  =  —  Xj,  and  X2  =  0  appear  to  capture  the  important  features 
of  Pnj  and  (by  symmetry)  P^.  Thus  one  plots  Pnj(x,x),  Pnj(x,—  x),  and  Pn^(x,0)  against  x.  This 
approach  provides  control  over  the  location  of  the  outliers;  and  it  is  convenient  for  simulation,  because 
one  can  generate  the  random  part  of  the  sample  and  then  vary  x  over  a  grid. 

Hoaglin  and  Iglewicz  used  the  models  with  1  and  2  fixed  outliers  to  compare  the  performance 
of  several  outlier-detection  rules  at  n=20  and  n=40.  They  considered  primarily  the  exploratory  rule, 
the  biweight  rule,  and  a  standard  test  based  on  the  sample  kurtosis.  (For  Gaussian  data  the  kurtosis 
test  has  certain  optimality  properties  when  the  outliers  arise  from  arbitrary  shifts  in  location  or  from 
arbitrary  increases  in  variance.)  AN  three  were  set  for  outlier  detection  at  the  .05  level.  At  both  n= 20 
and  n=40,  Pn(x)  for  the  biweight  rule  follows  that  for  the  kurtosis  test  fairly  closely,  whereas  the  curve 
for  the  exploratory  rule  drops  well  below’  for  x>3.  For  the  cross-section  PRj(x,0)  a  similar  pattern 
emerges,  as  one  would  expect  from  the  fact  that  the  observation  at  0  cannot  be  an  outlier.  The  other 
two  cross-sections  give  a  more  complicated  picture.  When  the  two  fixed  outliers  are  both  at  x  and 
n=20,  the  biweight  rule  and  the  exploratory  rule  perform  as  well  as  the  kurtosis  test  and  perhaps 
slightly  better  over  3  <  x  <  5.  At  n=40,  however,  the  kurtosis  test  outperforms  the  other  two  by  a 
substantial  margin  over  the  same  interval.  And  similar  patterns  of  domination  emerged  with  the  two 
fixed  outliers  at  x  and  — x,  both  at  n=20  and  n=40.  These  results  demonstrate  the  usefulness  of  the 
fixed-outlier  model  in  comparing  outlier-detection  rules.  Further  study  of  them  may  lead  to 
improvements  in  the  biweight  rule  or  suggest  other  robust/resistant  rules. 

Rocke  (1983)  studied  robust  estimation  of  variance  components  in  the  analysis  of  variance. 
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Using  a  substantial  simulation,  he  compared  the  performance  of  the  standard  estimators  of  variance 
components  with  those  of  two  robust  methods  (Huber  and  biweight)  in  the  two-way  mixed  model  with 
various  relative  sizes  of  between-means  variance,  various  sizes  and  frequency  of  outliers,  and  various 
sample  sizes.  Rocke  observed  that  for  small  between-means  variance  the  biweight  estimator  is  strongly 
biased  compared  to  the  Huber  estimator.  From  further  analysis  of  the  results  of  Rocke’s  simulations, 
Mosteller  found  that,  in  large  parts  of  the  space  of  simulations,  the  biweight  was  much  to  be  preferred 
in  spite  of  its  poor  performance  in  the  more  “null”  situation.  More  important,  in  much  of  the 
simulation  space  both  estimators  performed  poorly,  even  when  one  was  much  to  be  preferred.  This 
latter  finding  is  especially  troubling  because  the  basic  methods  underlying  these  two  estimators  have 
been  especially  successful  in  handling  outliers  in  the  one-sample  problem.  In  correspondence  with  John 
W.  Tukey  these  results  led  us  to  conclude  that  we  will  need  to  reformulate  the  goals  of  robust 
estimation  for  the  analysis  of  variance. 

Because  of  their  continuing  interest  in  regression  diagnostics  for  high-leverage  and  influential 
observations,  Hoaglin  and  Kemptborne  (1986)  contributed  discussion  on  a  review  paper  by  Chatterjee 
and  Hadi.  The  discussion  emphasized  cutoffs,  rules  of  thumb,  and  their  role  in  identifying  influential 
observations;  simple  residual  plots  that  display  high- leverage,  outlying,  and  influential  observations 
simultaneously;  approaches  to  uncovering  influential  groups  of  observations;  and  selection  of  subsets  of 
carriers.  It  also  sketched  a  step-by-step  diagnostic  strategy  that  should  be  useful  in  practice. 

As  a  basis  for  better  empirical  understanding  of  the  behavior  of  regression  diagnostics  in  large 
data  sets,  Hoaglin  obtained  (from  a  colleague  at  Abt  Associates  Inc.)  the  values  of  various  regression 
diagnostics  (e.g.,  diagonal  elements  of  the  hat  matrix,  DFITS,  DBETAS,  and  studentized  residuals) 
that  had  been  computed  for  two  sizable  regressions.  One  involved  n  =  1034  observations,  the  other 
had  5719,  and  both  had  p=21  explanatory  variables  (including  the  constant).  For  the  smaller  data  set 
he  analyzed  the  frequency  with  which  the  values  of  DFITS  and  DBETAS  •  •  -,  DBETAS21  fell 
outside  the  recommended  cutoff  values  (±2,|p/n  for  DFITS  and  ±2/Tn  for  DBETAS).  For  DFITS  the 
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percentage  was  6.0%,  whereas  for  the  DBETAS  it  ranged  from  0.6%  to  6.8%  (often  closer  to  4%  or  6% 
than  to  the  nominal  5%).  Normal  probability  plots  for  DFITS  and  one  of  the  DBETAS  revealed  that 
their  distributions  (across  the  observations  in  the  data  set)  had  substantially  heavier  tails  than  the 
Gaussian. 

Basic  techniques  for  diagnosing  leverage  and  influence  in  regression  could  be  applied  more  often 
than  they  so  far  seem  to  be.  As  one  step  in  the  process  of  speeding  the  spread  of  diagnostics  in 
everyday  applications  of  regression,  Hoaglin  (1988a)  gave  a  tutorial  account  of  leverage  and  influence  in 
simple  linear  regression. 

The  area  of  quality  control  offers  many  opportunities  to  apply  exploratory  and  robust 
methods.  Iglewicz  and  Hoaglin  (1987)  devised  a  control  chart  that  uses  boxplots  to  show  more 
information  than  the  customary  means  and  ranges  and  to  reveal  possibly  outlying  observations. 

In  an  application  of  robust/resistant  methods  to  anthropometrj ,  Himes  and  Hoaglin  (1989) 
investigated  techniques  for  smoothing  reference  data  on  human  growth.  For  such  measurements  as 
triceps  skinfold  thickness,  the  available  (cross-sectional)  data  usually  give  selected  percentiles  by  sex 
and  single  year  of  age.  By  working  with  the  sequences  of  differences  between  adjacent  age-specific 
percentiles,  a  variant  of  the  exploratory  technique  known  as  delineation  ensures  that  the  resulting 
smoothed  percentiles  maintain  the  proper  order.  In  an  example  Himes  and  Hoaglin  obtained  smoothed 
percentiles  that  provided  a  more  satisfactory  representation  of  the  raw  percentiles  than  did  the 
published  smoothed  percentiles,  which  researchers  at  the  National  Center  for  Health  Statistics  had 
produced  by  straightforward  use  of  cubic-spline  regression. 

For  data  that  might  be  handled  by  two-way  analysis  of  variance  with  one  observation  per  cell, 
diagnosis  of  multiplicative  structure  can  encounter  difficulties  if  not  done  resistantly.  In  its  most 
general  form  a  single  multiplicative  term  in  the  residuals  after  a  simple  additive  fit,  e^  =  y.j  — 
(m+aj+bj),  can  be  summarized  as  kcjdj.  Least-squares  fitting,  however,  allows  a  single  aberrant 
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observation  among  the  y-  to  perturb  the  residuals  in  a  pattern  of  the  form  kcjdj.  Thus  an  isolated 
anomaly  leaks  into  the  e-  as  a  systematic  pattern.  One  can  avoid  such  leakage  by  obtaining  the 
residuals  from  median  polish  or  another  suitably  resistant  analysis.  To  facilitate  diagnosis  of 
multiplicative  structure,  Hoaglin  refined  this  approach  by  using  the  graphical  display  known  as  the 
scatterplot  matrix.  One  examines  the  scatterplots  for  all  pairs  of  columns  (or  rows)  of  the  e^;  that  is, 
the  (s,t)  cell  in  the  matrix  contains  the  scatterplot  of  ejs  versus  e^,  one  point  for  each  i.  If  the  e- 
resemble  kc^dj,  then  the  (s,t)  scatterplot  will  tend  to  follow  a  straight  line  with  slope  ds/dt.  One  can 
readily  assess  the  possibility  of  leakage  problems  by  comparing  the  scatterplot  matrix  for  the  least- 
squares  residuals  with  one  for  resisiant  residuals.  Several  examples  have  given  encouraging  results. 

Distribution  Shape 

In  a  number  of  contexts  it  can  be  beneficial  to  develop  detailed  information  on  the  shapes  of 
distributions  that  tend  to  arise  in  real  data.  For  example,  data  in  a  particular  area  of  application  may 
suggest  a  need  for  a  robust  estimator,  and  one  might  like  to  base  the  choice  of  an  estimator  on  an 
assessment  of  the  ways  in  which  the  underlying  distribution  may  depart  from  an  ideal  shape  (usually 
t lie  Gaussian).  In  another  example  the  samples  may  be  large  enough  to  provide  a  reasonable  amount 
of  information  on  the  shape  of  the  distribution,  so  that  one  can  criticize  an  assumed  distributional 
model  and  adapt  the  model  to  the  data. 

As  part  of  our  research  on  distribution  shape  we  added  substantially  to  our  collection  of  data 
for  study.  The  data  sets  include  the  following: 

Opthamology  (spherical  and  cylindrical  refractive  errors,  n  =  510) 

Blood  pressure  (by  sex  in  three  age  groups,  n  ranges  from  534  to  965) 

Laboratory  measurements  (mostly  blood  chemistry)  on  subjects  in  two  treatment  groups  (n  ss  1300) 
Annual  incomes  reported  by  low-income  households  in  two  sites  (n  =  994  and  608) 

Climatic  variables  in  Texas  (15  variables  for  each  of  189  quadrats  covering  the  state) 
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Birth  weights  from  the  Medical  Birth  Registry  of  Norway:  first  and  second  births  to  women  who  had 
exactly  two  births,  both  singleton,  during  the  period  1975-1984  (by  the  sexes  of  the  two 
children  and  the  year  of  the  first  birth,  n  »  4000  per  stratum). 

For  the  last  of  the  data  sets  in  this  list,  the  Norwegian  birth  data,  Hoaglin  began  systematic 
analysis,  focusing  primarily  on  questions  related  to  low-weight  births.  He  used  a  variety  of  graphical 
techniques  including  quantile-quantile  plots,  along  with  numerical  techniques  based  on  the  g-and-h 
distributions  (see  below).  Initial  results  indicated,  among  other  things,  that  distributions  of  second- 
birth  weights  closely  resemble  a  mixture  of  a  Gaussian  distribution  and  a  small  percentage  of  lower- 
weight  births,  but  not  so  closely  as  previous  analyses  in  the  literature  have  assumed.  He  also  found 
that  adjustment  of  the  birth  weight  of  the  second  child  for  multiple  linear  regression  on  the  birth 
weight  of  the  first  child  and  the  gestational  age  of  the  second  child  produced  a  distribution  whose  shape 
is  much  closer  to  Gaussian.  These  shapes  showed  good  agreement  across  the  four  combinations  of  the 
sexes  of  the  two  children  and  across  strata  on  the  birth  weight  of  the  first  child.  The  findings  should 
contribute  to  the  discussion  of  the  phenomenon  of  repeated  low-weight  births. 

Focusing  on  another  aspect  of  distribution  shape,  Mosteller  gathered  a  substantial  number  of 
references  about  the  frequency  with  which  outliers  enter  real  sets  of  data  in  the  empirical  literature,  as 
well  as  about  fitting  the  tails  of  data  distributions.  In  further  work  he  plans  to  study  tail  behavior  by 
empirical  fitting  of  generalized  Pareto  distributions  (Smith,  1987). 

For  quantitative  description  of  distribution  shape,  we  often  use  Tukey’s  family  of  g-and-h 
distributions,  discussed  by  Hoaglin  (1985).  By  using  quantiles  the  basic  techniques  for  this  family 
provide  resistance,  and  they  offer  greater  flexibility  than  the  usual  third  and  fourth  moments.  In  terms 
of  a  standard  Gaussian  random  variable  Z  the  basic  random  variable  Y  of  the  g-and-h  family  (for  a 
fixed  value  of  the  skewness  parameter  g  and  the  elongation  parameter  h)  is  given  by 
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(2) 


Y  -  efLi  ehz2/2 

i  -  g  e 

One  can  readily  introduce  location  and  scale  parameters. 

To  provide  a  description  of  skewness,  for  example,  one  begins  with  the  median  y  ^  and  some 
selected  pairs  of  quantiles  yp  and  yj  at  tail  areas  p  (0<p<.5)  that  are  either  specified  in  advance 
(e.g.,  p  =  |,  j^,  ...)  or  suggested  by  the  data.  The  position  of  yp  and  yj  relative  to  y  ^ 

corresponds  to  a  value  of  g  (denoted  by  gp  or  gp)  according  to 

1  ,  yl-p_  y.5 

gp  =  -  loge  y/--yp  -  (3) 

where  zp  is  the  pth  quantile  of  the  standard  Gaussian  distribution. 

As  a  basis  for  our  empirical  work  on  distribution  shape,  we  needed  an  estimate  of  the 
variability  of  gp  from  a  set  of  data.  Thus  Hoaglin  and  Tukey  (1988)  developed  an  approximation  for 
var(gp).  Their  approach  applies  the  anglit  transformation  to  estimate  the  variance  of  yp,  j'j_p,  and 
y  tj.  The  data  may  come  either  as  order  statistics  or  as  a  frequency  distribution.  Hoaglin  used  these 
approximations  in  studying  skewness  in  the  Norwegian  birth  weight  data. 

Analyses  of  data  often  involve  a  transformation,  such  as  the  logarithm,  square  root,  or 
reciprocal,  and  these  nonlinear  transformations  affect  the  shape  of  distributions.  To  illustrate  the  point 
that  such  transformations  or  their  results  arise  in  everyday  life  more  often  than  most  people  realize, 
Hoaglin  (1988b)  collected  and  discussed  a  variety  of  nontechnical  examples,  including  magnitudes  of 
stars,  the  Richter  scale  for  earthquakes,  intensity  of  sounds,  average  speed  in  auto  races,  and  measures 
of  gasoline  consumption. 


9 


Computer  Software 


In  connection  with  our  work  on  new  techniques  of  data  analysis,  we  often  develop  computer 
software  that  implements  the  techniques.  From  the  start  the  design  of  such  programs  looks  beyond  an 
initial  version  for  personal  use  to  a  more  polished  version  that  could  readily  be  made  available  to 
others. 

During  the  period  of  the  contract  our  main  effort  in  this  area  focused  on  programs  that 
calculate  estimates  of  var(gp)  at  selected  values  of  p  for  data  to  which  one  is  applying  the  g-and-h 
distributions,  as  described  above.  Separate  versions  start  with  an  ordered  sample  or  a  frequency 
distribution.  Each  delivers  the  values  of  the  gp,  the  estimated  var(gp),  confidence  intervals  for  the  gp, 
and  estimated  covariances  of  the  gp  at  different  p. 

Occasionally  our  interest  in  software  leads  us  to  review  available  implementations  of  a 
particular  technique.  Frigge,  Hoaglin,  and  Iglewicz  (1987,  1989)  examined  the  implementations  (in 
Minitab,  S,  SAS,  SPSS,  Statgraphics,  and  Systat)  of  the  exploratory  display  known  as  the  boxplot, 
discussed  the  confusing  variety  of  nonstandard  definitions,  and  made  suggestions  for  improvements. 
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